{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # in google colab uncomment this\n",
    "\n",
    "# import os\n",
    "\n",
    "# os.system('apt-get install -y xvfb')\n",
    "# os.system('wget https://raw.githubusercontent.com/yandexdataschool/Practical_DL/fall18/xvfb -O ../xvfb')\n",
    "# os.system('apt-get install -y python-opengl ffmpeg')\n",
    "# os.system('pip install pyglet==1.2.4')\n",
    "# os.system('pip install gym')\n",
    "\n",
    "# prefix = 'https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week09_policy_II/'\n",
    "\n",
    "# os.system('wget ' + prefix + 'runners.py')\n",
    "# os.system('wget ' + prefix + 'mujoco_wrappers.py')\n",
    "\n",
    "# print('setup complete')\n",
    "\n",
    "# XVFB will be launched if you run on a server\n",
    "import os\n",
    "if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
    "    !bash ../xvfb start\n",
    "    os.environ['DISPLAY'] = ':1'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing Proximal Policy Optimization \n",
    "\n",
    "\n",
    "In this notebook you will be implementing Proximal Policy Optimization algorithm, \n",
    "scaled up version of which was used to train [OpenAI Five](https://openai.com/blog/openai-five/) \n",
    "to [win](https://openai.com/blog/how-to-train-your-openai-five/) against the\n",
    "world champions in Dota 2.\n",
    "You will be solving a continuous control environment on which it may be easier and faster \n",
    "to train an agent, however note that PPO here may not be the best algorithm as, for example,\n",
    "Deep Deterministic Policy Gradient and Soft Actor Critic may be more suited \n",
    "for continuous control environments. To run the environment you will need to install \n",
    "[pybullet-gym](https://github.com/benelot/pybullet-gym) which unlike MuJoCo \n",
    "does not require you to have a license.\n",
    "\n",
    "To install the library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/benelot/pybullet-gym lib/pybullet-gym\n",
    "!pip install -e lib/pybullet-gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The overall structure of the code is similar to the one in the A2C optional homework, but don't worry if you haven't done it, it should be relatively easy to figure it out. \n",
    "First, we will create an instance of the environment. \n",
    "We will normalize the observations and rewards, but before that you will need a wrapper that will \n",
    "write summaries, mainly, the total reward during an episode. You can either use one for `TensorFlow` \n",
    "implemented in `atari_wrappers.py` file from the optional A2C homework, or implement your own. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym \n",
    "import pybulletgym\n",
    "\n",
    "env = gym.make(\"HalfCheetahMuJoCoEnv-v0\")\n",
    "print(\"observation space: \", env.observation_space,\n",
    "      \"\\nobservations:\", env.reset())\n",
    "print(\"action space: \", env.action_space, \n",
    "      \"\\naction_sample: \", env.action_space.sample())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Summaries(gym.Wrapper):\n",
    "  \"\"\" Wrapper to write summaries. \"\"\"\n",
    "  def step(self, action):\n",
    "    # TODO: implement writing summaries\n",
    "    return self.env.step(action)\n",
    "  \n",
    "  def reset(self, **kwargs):\n",
    "    # TODO: implement writing summaries\n",
    "    return self.env.reset(**kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The normalization wrapper will subtract running mean from observations and rewards and divide \n",
    "the resulting quantities by the  running variances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mujoco_wrappers import Normalize\n",
    "\n",
    "env = Normalize(Summaries(gym.make(\"HalfCheetahMuJoCoEnv-v0\")));\n",
    "env.unwrapped.seed(0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, you will need to define a model for training. We suggest that you use two separate networks: one for policy\n",
    "and another for value function. Each network should be a 3-layer MLP with 64 hidden units, $\\mathrm{tanh}$ \n",
    "activation function, kernel matrices initialized with orthogonal initializer with parameter $\\sqrt{2}$\n",
    "and biases initialized with zeros. \n",
    "\n",
    "Our policy distribution is going to be multivariate normal with diagonal covariance. \n",
    "The network from above will predict the mean, and the covariance should be represented by a single \n",
    "(learned) vector of size 6 (corresponding to the dimensionality of the action space from above). \n",
    "You should initialize this vector to zero and take the exponent of it to always\n",
    "have a non-negative quantity. \n",
    "\n",
    "Overall the model should return three things: predicted mean of the distribution, variance vector, \n",
    "value function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import tensorflow as tf\n",
    "# import torch\n",
    "\n",
    "<Define your model here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model will be wrapped by a `Policy`. The policy can work in two modes, but in either case \n",
    "it is going to return dictionary with string-type keys. The first mode is when the policy is \n",
    "used to sample actions for a trajectory which will later be used for training. In this case \n",
    "the flag `training` passed to `act` method is `False` and the method should return \n",
    "a `dict` with the following keys: \n",
    "\n",
    "* `\"actions\"`: actions to pass to the environment\n",
    "* `\"log_probs\"`: log-probabilities of sampled actions\n",
    "* `\"values\"`: value function $V^\\pi(s)$ predictions.\n",
    "\n",
    "We don't need to use the values under these keys for training, so all of them should be of type `np.ndarray`.\n",
    "\n",
    "When `training` is `True`, the model is training on a given batch of observations. In this\n",
    "case it should return a `dict` with the following keys\n",
    "\n",
    "* `\"distribution\"`: an instance of multivariate normal distribution (`torch.distributions.MultivariateNormal` or `tf.distributions.MultivariateNormalDiag`)\n",
    "* `\"values\"`: value function $V^\\pi(s)$ prediction.\n",
    "\n",
    "The distinction about the modes comes into play depending on where the policy is used: if it is called from `EnvRunner`, \n",
    "the `training` flag is `False`, if it is called from `PPO`, the `training` flag is `True`. These classed \n",
    "will be described below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Policy:\n",
    "  def __init__(self, model):\n",
    "    self.model = model\n",
    "    \n",
    "  def act(self, inputs, training=False):\n",
    "    <TODO: Implement policy by calling model>\n",
    "    # Should return a dict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use `EnvRunner` to perform interactions with an environment with a policy for a fixed number of timesteps. Calling `.get_next()` on a runner will return a trajectory &mdash; dictionary \n",
    "containing keys\n",
    "\n",
    "* `\"observations\"`\n",
    "* `\"rewards\"` \n",
    "* `\"resets\"`\n",
    "* `\"actions\"`\n",
    "* all other keys that you defined in `Policy`,\n",
    "\n",
    "under each of these keys there is a `np.ndarray` of specified length $T$ &mdash; the size of partial trajectory. \n",
    "\n",
    "Additionally, before returning a trajectory this runner can apply a list of transformations. \n",
    "Each transformation is simply a callable that should modify passed trajectory in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AsArray:\n",
    "  \"\"\" \n",
    "  Converts lists of interactions to ndarray.\n",
    "  \"\"\"\n",
    "  def __call__(self, trajectory):\n",
    "    # Modify trajectory inplace. \n",
    "    for k, v in filter(lambda kv: kv[0] != \"state\",\n",
    "                       trajectory.items()):\n",
    "      trajectory[k] = np.asarray(v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from runners import EnvRunner\n",
    "\n",
    "class DummyPolicy:\n",
    "  def act(self, inputs, training=False):\n",
    "    assert not training\n",
    "    return {\"actions\": np.random.randn(6), \"values\": np.nan}\n",
    "  \n",
    "runner = EnvRunner(env, DummyPolicy(), 3,\n",
    "                   transforms=[AsArray()])\n",
    "trajectory = runner.get_next()\n",
    "\n",
    "{k: v.shape for k, v in trajectory.items() if k != \"state\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will need to implement the following two transformations. \n",
    "\n",
    "The first is `GAE` that implements [Generalized Advantage Estimator](https://arxiv.org/abs/1506.02438).\n",
    "In it you should add two keys to the trajectory: `\"advantages\"` and `\"value_targets\"`. In GAE the advantages\n",
    "$A_t^{\\mathrm{GAE}(\\gamma,\\lambda)}$ are essentially defined as the exponential \n",
    "moving average with parameter $\\lambda$ of the regular advantages \n",
    "$\\hat{A}^{(n)}(s_t) = \\sum_{l=0}^{T-1} \\gamma^l r_{t+l} + \\gamma^{T} V^\\pi(s_{t+l}) - V^\\pi(s_t)$. \n",
    "The exact formula for the computation is the following\n",
    "\n",
    "$$\n",
    "A_t^{\\mathrm{GAE}(\\gamma,\\lambda)} = \\sum_{l=0}^{T-1} (\\gamma\\lambda)^l\\delta_{t + l}^V,\n",
    "$$\n",
    "where $\\delta_{t+l}^V = r_{t+l} + \\gamma V^\\pi(s_{t+l+1}) - V^\\pi(s_{t+l})$. You can look at the \n",
    "derivation (formulas 11-16) in the paper. Don't forget to reset the summation on terminal\n",
    "states as determined by the flags `trajectory[\"resets\"]`. You can use `trajectory[\"values\"]`\n",
    "to get values of all observations except the most recent which is stored under \n",
    " `trajectory[\"state\"][\"latest_observation\"]`. For this observation you will need to call the policy \n",
    " to get the value prediction.\n",
    "\n",
    "Once you computed the advantages, you can get the targets for training the value function by adding \n",
    "back values:\n",
    "$$\n",
    "\\hat{V}(s_{t+l}) = A_{t+l}^{\\mathrm{GAE}(\\gamma,\\lambda)} + V(s_{t + l}),\n",
    "$$\n",
    "where $\\hat{V}$ is a tensor of value targets that are used to train the value function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GAE:\n",
    "  \"\"\" Generalized Advantage Estimator. \"\"\"\n",
    "  def __init__(self, policy, gamma=0.99, lambda_=0.95):\n",
    "    self.policy = policy\n",
    "    self.gamma = gamma\n",
    "    self.lambda_ = lambda_\n",
    "    \n",
    "  def __call__(self, trajectory):\n",
    "    <TODO: implement>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main advantage of PPO over simpler policy based methods like A2C is that it is possible\n",
    "to train on the same trajectory for multiple gradient steps. The following class wraps \n",
    "an `EnvRunner`. It should call the runner to get a trajectory, then return minibatches \n",
    "from it for a number of epochs, shuffling the data before each epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TrajectorySampler:\n",
    "  \"\"\" Samples minibatches from trajectory for a number of epochs. \"\"\"\n",
    "  def __init__(self, runner, num_epochs, num_minibatches, transforms=None):\n",
    "    self.runner = runner\n",
    "    self.num_epochs = num_epochs\n",
    "    self.num_minibatches = num_minibatches\n",
    "    self.transforms = transforms or []\n",
    "    self.minibatch_count = 0\n",
    "    self.epoch_count = 0\n",
    "    self.trajectory = None\n",
    "    \n",
    "  def shuffle_trajectory(self):\n",
    "    \"\"\" Shuffles all elements in trajectory.\n",
    "    \n",
    "    Should be called at the beginning of each epoch.\n",
    "    \"\"\"\n",
    "    <TODO: implement>\n",
    "    \n",
    "  def get_next(self):\n",
    "    \"\"\" Returns next minibatch.  \"\"\"\n",
    "    <TODO: implement>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common trick to use with GAE is to normalize advantages, the following transformation does that. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class NormalizeAdvantages:\n",
    "  \"\"\" Normalizes advantages to have zero mean and variance 1. \"\"\"\n",
    "  def __call__(self, trajectory):\n",
    "    adv = trajectory[\"advantages\"]\n",
    "    adv = (adv - adv.mean()) / (adv.std() + 1e-8)\n",
    "    trajectory[\"advantages\"] = adv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can create our PPO runner. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_ppo_runner(env, policy, num_runner_steps=2048,\n",
    "                    gamma=0.99, lambda_=0.95, \n",
    "                    num_epochs=10, num_minibatches=32):\n",
    "  \"\"\" Creates runner for PPO algorithm. \"\"\"\n",
    "  runner_transforms = [AsArray(),\n",
    "                       GAE(policy, gamma=gamma, lambda_=lambda_)]\n",
    "  runner = EnvRunner(env, policy, num_runner_steps, \n",
    "                     transforms=runner_transforms)\n",
    "  \n",
    "  sampler_transforms = [NormalizeAdvantages()]\n",
    "  sampler = TrajectorySampler(runner, num_epochs=num_epochs, \n",
    "                              num_minibatches=num_minibatches,\n",
    "                              transforms=sampler_transforms)\n",
    "  return sampler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next cell you will need to implement Proximal Policy Optimization algorithm itself. The algorithm\n",
    "modifies the typical policy gradient loss in the following way:\n",
    "\n",
    "$$\n",
    "L_{\\pi} = \\frac{1}{T}\\sum_{l=0}^{T-1}\n",
    "\\frac{\\pi_\\theta(a_{t+l}|s_{t+l})}{\\pi_\\theta^{\\text{old}}(a_{t+l}|s_{t+l})}\n",
    "A^{\\mathrm{GAE}(\\gamma,\\lambda)}_{t+l}\\\\\n",
    "L_{\\pi}^{\\text{clipped}} = \\frac{1}{T}\\sum_{l=0}^{T-1}\\mathrm{clip}\\left(\n",
    "\\frac{\\pi_\\theta(a_{t+l}|s_{t+l})}{\\pi_{\\theta^{\\text{old}}}(a_{t+l}|s_{t+l})}\n",
    "\\cdot A^{\\mathrm{GAE(\\gamma, \\lambda)}}_{t+l},\n",
    "1 - \\text{cliprange}, 1 + \\text{cliprange}\\right)\\\\\n",
    "L_{\\text{policy}} = -\\min\\left(L_\\pi, L_{\\pi}^{\\text{clipped}}\\right).\n",
    "$$\n",
    "\n",
    "Additionally, the value loss is modified in the following way:\n",
    "\n",
    "$$\n",
    "L_V = \\frac{1}{T}\\sum_{l=0}^{T-1}(V_\\theta(s_{t+l}) - \\hat{V}(s_{t+l}))^2\\\\\n",
    "L_{V}^{\\text{clipped}} = \\frac{1}{T}\\sum_{l=0}^{T-1}\n",
    "V_{\\theta^{\\text{old}}}(s_{t+l}) +\n",
    "\\text{clip}\\left(\n",
    "V_\\theta(s_{t+l}) - V_{\\theta^\\text{old}}(s_{t+l}),\n",
    "-\\text{cliprange}, \\text{cliprange}\n",
    "\\right)\\\\\n",
    "L_{\\text{value}} = -\\min\\left(L_V, L_V^{\\text{clipped}}\\right).\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PPO:\n",
    "  def __init__(self, policy, optimizer,\n",
    "               cliprange=0.2,\n",
    "               value_loss_coef=0.25,\n",
    "               max_grad_norm=0.5):\n",
    "    self.policy = policy\n",
    "    self.optimizer = optimizer\n",
    "    self.cliprange = cliprange\n",
    "    self.value_loss_coef = value_loss_coef\n",
    "    # Note that we don't need entropy regularization for this env.\n",
    "    self.max_grad_norm = max_grad_norm\n",
    "    \n",
    "  def policy_loss(self, trajectory, act):\n",
    "    \"\"\" Computes and returns policy loss on a given trajectory. \"\"\"\n",
    "    <TODO: implement>\n",
    "      \n",
    "  def value_loss(self, trajectory, act):\n",
    "    \"\"\" Computes and returns value loss on a given trajectory. \"\"\"\n",
    "    <TODO: implement>\n",
    "      \n",
    "  def loss(self, trajectory):\n",
    "    act = self.policy.act(trajectory[\"observations\"], training=True)\n",
    "    policy_loss = self.policy_loss(trajectory, act)\n",
    "    value_loss = self.value_loss(trajectory, act)\n",
    "    return policy_loss + self.value_loss_coef * value_loss\n",
    "      \n",
    "  def step(self, trajectory):\n",
    "    \"\"\" Computes the loss function and performs a single gradient step. \"\"\"\n",
    "    <TODO: implement>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now everything is ready to do training. In one million of interactions it should be possible to \n",
    "achieve the total raw reward of about 1500. You should plot this quantity with respect to \n",
    "`runner.step_var` &mdash; the number of interactions with the environment. It is highly \n",
    "encouraged to also provide plots of the following quantities (these are useful for debugging as well):\n",
    "\n",
    "* [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) between \n",
    "value targets and value predictions\n",
    "* Entropy of the policy $\\pi$\n",
    "* Value loss\n",
    "* Policy loss\n",
    "* Value targets\n",
    "* Value predictions\n",
    "* Gradient norm\n",
    "* Advantages\n",
    "\n",
    "For optimization it is suggested to use Adam optimizer with linearly annealing learning rate \n",
    "from 3e-4 to 0 and epsilon 1e-5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
